HAD Phase 4.5 C: linearity-family pretests under survey#370
Conversation
|
Overall Assessment Executive Summary
Methodology
Code Quality
Performance Maintainability Tech Debt
Security Documentation/Tests
Path to Approval
Static review only: I couldn’t run |
R1 P0 — Stute survey path silently accepted zero-weight units, which leak into the dose-variation check + CvM cusum + bootstrap refit while contributing zero population mass. Extreme case: only zero-weight units carry dose variation -> spurious finite test statistic with no warning. Fix: strictly-positive guards on every survey-aware Stute / Yatchew / workflow entry point (the weights= shortcut already had this; survey= branch was the gap). R1 P1 #1 — aweight/fweight survey designs slipped through pweight-only formulas silently (the variance components are derived assuming pweight sandwich semantics). Fix: weight_type='pweight' guards added in _resolve_pretest_unit_weights and on every direct-helper survey= branch (stute_test, yatchew_hr_test, stute_joint_pretest). Mirrors HAD.fit guard at had.py:2976 + survey._resolve_pweight_only at survey.py:914. R1 P1 #2 — workflow's row-level weights= crashed on staggered event- study panels because _validate_multi_period_panel filters to last cohort but the joint wrappers re-aggregate with the original full- panel weights array. Fix: subset joint_weights to data_filtered's rows via data.index.get_indexer(data_filtered.index) BEFORE passing to the wrappers. Mirrors HeterogeneousAdoptionDiD.fit positional- index pattern. Survey= path is unaffected (column references resolve internally on data_filtered). R1 P3 — REGISTRY C0 note still said "the same gate applies to did_had_pretest_workflow" and "Phase 4.5 C uses Rao-Wu rescaling"; both are stale post-C. Updated to clarify (a) workflow gate was temporary and is now closed by C, (b) qug_test direct-helper gate remains permanent, (c) C uses PSU-level Mammen multiplier bootstrap (NOT Rao-Wu rescaling). 7 new tests in TestPhase45CR1Regressions covering: zero-weight survey on stute_test / stute_joint_pretest / workflow; aweight rejection on stute_test / workflow; fweight rejection on yatchew_hr_test; staggered event-study workflow with weights= (catches the length-mismatch crash). 165 pretest tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
Static review only: I could not run the test suite in this sandbox because the available Python environment is missing project dependencies ( |
|
/ai-review |
R2 P1 #1 (Code Quality) -- joint_pretrends_test and joint_homogeneity_test direct calls still crashed on staggered panels because the staggered- weights subset fix from R1 was only applied at the workflow level. The wrappers run their own _validate_had_panel_event_study() and may filter to data_filtered, then passed the original full-panel weights array to _resolve_pretest_unit_weights(data_filtered, ...) which expects the filtered row count. Fix: subset row-level weights to data_filtered.index positions (via data.index.get_indexer) BEFORE _resolve_pretest_unit_weights, mirroring the workflow fix. R2 P1 #2 (Methodology) -- REGISTRY note documented the bootstrap perturbation as `dy_b = fitted + eps * w * eta_obs`, but the code does `dy_b = fitted + eps * eta_obs` (no `* w`). Code is correct: paper Appendix D wild-bootstrap perturbs UNWEIGHTED residuals; weighting flows through the OLS refit and the weighted CvM, not through the perturbation. Adding `* w` would over-weight by w². Fix: update REGISTRY note to remove the spurious `* w` and clarify the canonical form. Add a regression that pins (a) bit-exact cvm_stat reduction at uniform weights, (b) bootstrap p-value distributional agreement within Monte-Carlo noise. R2 P3 -- in-code docstrings still referenced the pre-Phase-4.5-C contract: - qug_test docstring said survey-aware Stute "admits a Rao-Wu rescaled bootstrap" (PSU-level Mammen multiplier bootstrap is what shipped). Updated to reflect the correct mechanism. - HADPretestReport.all_pass docstring described the unweighted contract only; survey/weights path drops the QUG-conclusiveness gate (linearity-conditional admissibility per C0 deferral). Updated. 3 new regression tests in TestPhase45CR1Regressions: - test_joint_pretrends_test_staggered_weights_subset - test_joint_homogeneity_test_staggered_weights_subset - test_stute_survey_perturbation_does_not_double_weight (locks the perturbation form via cvm_stat bit-exact reduction + p-value MC bound) 168 pretest tests pass (was 165 after R1). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ⛔ Blocker Executive Summary
Methodology
Code Quality No findings. The prior staggered-panel Performance No findings. Maintainability No findings beyond the documentation drift below. Tech Debt
Security No findings. Documentation/Tests
Path to Approval
|
|
/ai-review |
R3 P0 (Methodology) -- variance-unidentified survey-design guard. generate_survey_multiplier_weights_batch returns an all-zero (B, n_psu) multiplier matrix when n_psu - n_strata <= 0 (e.g. unstratified single-PSU design where df_survey = n_psu - 1 = 0). Without a guard, the Stute survey path treated zero perturbations as a valid bootstrap law and emitted p_value = 1/(B+1) for any positive observed CvM (spurious rejection on a survey edge case). Fix: detect df_survey <= 0 BEFORE calling the multiplier helper and return a NaN p_value with a UserWarning, matching the broader package convention for variance-unidentified survey designs (mirrors compute_survey_vcov NaN treatment, ResolvedSurveyDesign.df_survey semantics). Applied in stute_test (single-horizon) and stute_joint_pretest (joint variant); the data-in wrappers and the workflow propagate via the same calls. R3 P3 -- in-code docstrings still drifted on three surfaces: - qug_test NotImplementedError text said survey-aware Stute is "planned" and "admits a Rao-Wu rescaled bootstrap"; both are stale post-Phase-4.5-C. Updated to reflect the shipped reality (PSU-level Mammen multiplier bootstrap; QUG step is skipped under survey/weights in the workflow). - joint_pretrends_test, joint_homogeneity_test, stute_joint_pretest Parameters sections did not document the new weights= / survey= kwargs at all. Added with the per-test contract (per-unit aggregation, staggered subsetting, replicate/aweight/fweight rejection, variance-unidentified guard). 3 new regression tests in TestPhase45CR1Regressions: - test_stute_test_single_psu_unstratified_returns_nan - test_stute_joint_pretest_single_psu_unstratified_returns_nan - test_workflow_single_psu_propagates_nan_through_stute (also documents Yatchew's PSU-agnostic-by-design behavior; verdict carries the linearity-conditional suffix) 171 pretest tests pass (was 168 after R2). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ⛔ Blocker Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
|
/ai-review |
R4 P0 (Methodology) -- Yatchew test statistic was not invariant to uniform pweight rescaling. The formula `T_hr = sqrt(sum(w)) * (...)` makes T_hr scale as sqrt(c) under weights -> w * c, so weights=w and weights=100*w produced different p-values for the same design. Worse, SurveyDesign.resolve() normalizes pweights to mean=1 internally, so the survey= entry path and the weights= shortcut disagreed numerically. Fix: normalize per-unit pweights to mean=1 at every helper entry (stute_test, yatchew_hr_test, stute_joint_pretest) and at the workflow resolution helper. Matches SurveyDesign.resolve() convention; makes the Yatchew statistic scale-invariant; ensures weights=w and survey=SurveyDesign(weights="w") produce identical results for the same design. Stute is internally scale-invariant in functional form but normalization is required so the bootstrap helper sees the same weight vector under both entry paths (cross-path numerical agreement). R4 P1 (Code Quality) -- column-vector weights (e.g. `df[["w"]].to_numpy()` producing (G, 1)) silently broadcast through weighted moments / CvM sums instead of raising. Fix: validate via `_validate_1d_numeric` on all `weights=` arrays in stute_test, yatchew_hr_test, stute_joint_pretest; add explicit ndim check in `_resolve_pretest_unit_weights` with a hint about the common df[["w"]].to_numpy() mistake. 6 new regression tests in TestPhase45CR1Regressions: - test_yatchew_weights_scale_invariant (weights=w vs weights=100*w) - test_stute_weights_scale_invariant (mirror for Stute) - test_workflow_weights_eq_survey_at_overall_path (weights= shortcut and survey=SurveyDesign(...) produce identical Yatchew + Stute results, atol=1e-10) - test_stute_test_rejects_2d_weights / test_yatchew_hr_test_rejects_2d_weights / test_workflow_rejects_2d_weights (column-vector rejection at all three direct-helper / workflow entry points) 177 pretest tests pass (was 171 after R3). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall AssessmentExecutive Summary
Methodology
Code QualityNo findings. PerformanceNo findings. MaintainabilityNo findings. Tech Debt
SecurityNo findings. Documentation/Tests
Path to Approval
|
|
/ai-review |
R5 P1 (Methodology) -- Stute survey df_survey<=0 guard was too broad. generate_survey_multiplier_weights_batch's pooled-singleton branch for lonely_psu='adjust' produces NONZERO multipliers, so the helper contract is non-degenerate even when df_survey=0 (one PSU per stratum). The previous guard rejected those designs as variance-unidentified when actually the issue is methodological: the analytical variance target requires a pseudo-stratum centering transform that has not been derived for the Stute CvM. Fix: mirror HAD sup-t bootstrap's explicit lonely_psu='adjust' rejection at had.py:2081-2118. Add new helper _has_lonely_psu_adjust_singletons in had_pretests.py. stute_test and stute_joint_pretest now reject lonely_psu='adjust' with singleton strata via NotImplementedError BEFORE the df_survey<=0 guard. The guard remains for genuinely degenerate designs (single-PSU unstratified, or one-PSU-per-stratum under remove/certainty where the helper returns all-zero multipliers). R5 P2 -- new regressions cover lonely_psu='adjust' rejection on both single-horizon and joint-Stute helpers; also a positive control showing lonely_psu='remove' singletons still flow through the existing df_survey-based NaN path. 3 new tests in TestPhase45CR1Regressions. REGISTRY note clarifies the carve-out: 'adjust' with singletons raises NotImplementedError; 'remove' / 'certainty' produce all-zero singleton multipliers and return NaN via the variance-unidentified guard. 180 pretest tests pass (was 177 after R4). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall AssessmentExecutive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
|
/ai-review |
R6 P1 #1 (Code Quality) -- did_had_pretest_workflow eagerly resolved weights/survey on the FULL panel before _validate_multi_period_panel applied the staggered last-cohort filter. Because _resolve_pretest_unit_weights enforces strictly-positive per-unit weights / pweight type / etc. on whatever data it sees, zero or otherwise-invalid weights on the soon-to-be-dropped cohort would abort an otherwise-valid event-study run. Fix: defer resolution to per-aggregate branches. - Top-level: only the survey/weights mutex check + use_survey_path presence detection (no resolution). - Overall path: resolve weights/survey AFTER _validate_had_panel (no cohort filtering on this path; original data IS the panel). - Event-study path: do NOT resolve at the workflow level. The joint wrappers (joint_pretrends_test / joint_homogeneity_test) own resolution and already see data_filtered (post staggered filter). Row-level weights= passed through with the existing positional subsetting (R1 P1 fix preserved). R6 P1 #2 (Documentation/Tests) -- positive PSU/strata survey coverage gap. Existing tests covered overall-workflow + trivial/no-PSU smokes; the PSU-aware multiplier-bootstrap path (the core new methodology) was unpinned for joint_homogeneity_test and the event-study workflow. 3 new regression tests in TestPhase45CR1Regressions: - test_joint_homogeneity_test_psu_strata_survey_smoke (non-trivial SurveyDesign(weights=, strata=, psu=) on the linearity wrapper). - test_workflow_event_study_psu_strata_survey_smoke (full event-study dispatch under PSU/strata clustering: validate_multi_period_panel + resolve on data_filtered + pretrends_joint + homogeneity_joint). - test_workflow_event_study_zero_weights_on_dropped_cohort (R6 P1 #1 fix regression: panel where the dropped early cohort has zero weights succeeds on the surviving last cohort; pre-fix this crashed with "weights must be strictly positive"). 183 pretest tests pass (was 180 after R5). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment⛔ Blocker - one new P0 methodology bug in the survey-weighted Stute statistic, plus one P1 report-contract regression. The two P1s from the previous review appear fixed. Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
|
/ai-review |
R7 P0 (Methodology) -- weighted CvM was missing the outer measure
factor. Survey-weighted plug-in of Stute's CvM functional integrates
the squared cusum process against the survey-weighted EDF
F_hat_w = (1/W) sum_i w_i delta_{D_i}, which weights BOTH the inner
cusum AND the outer integration measure:
C_g = sum_{h <= g} w_h * eps_h
S_w = (1/W^2) * sum_g w_g * (C_g)^2
Earlier revisions used (1/W^2) * sum_g C_g^2 (no outer w_g) which is
a count-weighted-cusum / uniform-outer-measure functional and silently
misreports survey-weighted Stute statistics for non-uniform weights.
At w=ones(G) both forms reduce to (1/G^2) sum_g C_g^2; only non-uniform
weights distinguish them, which is why the prior reduction tests
didn't catch this.
Fix: add the outer w_sorted factor to _cvm_statistic_weighted. New
oracle test pins the formula on a hand-computed non-uniform-weight
example (w=[1,2,3], eps=[1,-2,3] -> 127/36 outer-weighted, NOT 46/36
count-weighted-cusum). Reduction at w=1 still bit-exact.
R7 P1 (Code Quality) -- survey verdict could leave pass cases starting
with "inconclusive". The previous approach composed the unweighted
verdict against a synthetic NaN QUG and string-replaced "QUG NaN" out;
when no rejections fired and linearity was conclusive, the underlying
"inconclusive - QUG NaN" template would collapse to "inconclusive"
even for all_pass=True paths.
Fix: explicit survey-aware verdict composers
_compose_verdict_overall_survey and _compose_verdict_event_study_survey
that drop QUG from consideration entirely and emit linearity-only
priority text. Both append the
"(linearity-conditional verdict; QUG-under-survey deferred per Phase 4.5 C0)"
suffix in every branch (rejections, inconclusive, fail-to-reject).
4 new regression tests:
- test_cvm_statistic_weighted_outer_measure_oracle (R7 P0 hand-computed)
- test_cvm_statistic_weighted_reduces_at_uniform_weights (R7 P0 reduction)
- test_workflow_overall_survey_pass_does_not_say_inconclusive (R7 P1)
- test_workflow_event_study_survey_pass_does_not_say_inconclusive (R7 P1)
187 pretest tests pass (was 183 after R6).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment✅ Looks good — the prior P0 weighted-CvM issue and prior P1 survey-verdict regression both appear fixed. I did not find any unmitigated P0/P1 issues in the changed code. Static review only: I could not run Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
R8 P3 -- survey all_pass docstring/REGISTRY described the contract too
generically ("at least one linearity test conclusive"), which matches
the overall path but is looser than the implemented event-study path
(which requires BOTH pretrends_joint AND homogeneity_joint conclusive
+ non-rejecting).
Fix: split the survey all_pass description by aggregate in both
HADPretestReport docstring and REGISTRY note:
- overall: "at least one of Stute/Yatchew conclusive + no rejection"
(mirrors paper Section 4 step-3 'Stute OR Yatchew' wording).
- event_study: "both joint variants conclusive + neither rejects"
(same step-2 + step-3 closure as the unweighted aggregate, minus the
QUG step).
Code unchanged; only documentation. 187 pretest tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment
Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
|
/ai-review |
R9 P1 (Code Quality) -- staggered weights= subsetting bypassed the
standard length check. The R6 P1 fix added pos_idx-based subsetting
to align row-level weights with data_filtered, but the subsetting
ran without first validating that weights had length matching the
ORIGINAL data:
- Oversized arrays were silently truncated (pos_idx slice keeps only
matched rows; tail of weights was discarded with no warning).
- Undersized arrays surfaced raw NumPy IndexError instead of the
package's standard front-door ValueError.
Fix: validate weights as 1D AND len(weights) == len(data) BEFORE the
pos_idx slice, on all three staggered-weights entry points
(did_had_pretest_workflow, joint_pretrends_test,
joint_homogeneity_test). Subset only after validation passes.
R9 P3 -- did_had_pretest_workflow's Notes section still had the
un-split survey all_pass description ("at least one linearity test
conclusive"). The HADPretestReport docstring and REGISTRY were
updated in R8; this third location was missed. Now matches the
event-study contract that requires BOTH joint variants conclusive +
non-rejecting.
4 new regression tests in TestPhase45CR1Regressions:
- test_workflow_event_study_oversized_weights_raises
- test_workflow_event_study_undersized_weights_raises
- test_joint_pretrends_test_oversized_weights_raises
- test_joint_homogeneity_test_undersized_weights_raises
All assert the front-door ValueError contract instead of silent
truncation / raw IndexError. 191 pretest tests pass (was 187 after R8).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
|
/ai-review |
R10 P1 (Methodology) -- Stute survey bootstrap was silently miscalibrated on stratified designs. The HAD sup-t bootstrap (had.py:2120+) applies a within-stratum demean + sqrt(n_h/(n_h-1)) small-sample correction AFTER generate_survey_multiplier_weights_batch returns, to make the bootstrap variance match the Binder-TSL stratified target. The same correction has NOT been derived for the Stute CvM functional, so applying the helper's raw multipliers directly to residual perturbations on stratified designs left the bootstrap p-value silently miscalibrated. Per the reviewer's offered "narrow support" path: Phase 4.5 C now explicitly rejects stratified designs on the Stute family with NotImplementedError. Pweight-only, PSU-only, and FPC-only designs remain supported (the helper's output is appropriately scaled for those without further correction). Stratified is a follow-up after the matching Stute-CvM stratified-correction derivation lands. Mirrors the lonely_psu='adjust' rejection pattern (R5 P1) — both are methodology-gap-driven explicit NotImplementedErrors with documented follow-up. The strata guard supersedes the lonely_psu='adjust' singleton-strata guard for any stratified design (the latter is now defense-in-depth for the unstratified residual case). R10 P3 -- Added "stratified-design rejection" entry to REGISTRY's Note (Phase 4.5 C). Also updated CHANGELOG to narrow the documented survey contract. Tests updated: - test_stute_test_lonely_psu_adjust_singletons_raises -> test_stute_test_stratified_design_raises (the strata guard fires first; the test is still meaningful but on a strata key match). - Same renaming for stute_joint_pretest variant. - test_stute_test_lonely_psu_remove_singletons_returns_nan REMOVED (singleton strata under lonely_psu='remove' now hits the strata guard instead of the df_survey<=0 guard). - test_joint_homogeneity_test_psu_strata_survey_smoke -> test_joint_homogeneity_test_psu_only_survey_smoke (positive coverage on PSU-only design) + new test_joint_homogeneity_test_stratified_raises. - test_workflow_event_study_psu_strata_survey_smoke -> test_workflow_event_study_psu_only_survey_smoke. - test_workflow_event_study_survey_pass_does_not_say_inconclusive switched from strata to PSU-only. 191 pretest tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment
Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
Closes the Phase 4.5 C0 promise (PR #367 commit 29f8b12). Linearity- family pretests now accept survey=/weights= keyword-only kwargs: - stute_test, yatchew_hr_test, stute_joint_pretest, joint_pretrends_test, joint_homogeneity_test, did_had_pretest_workflow. Stute family: PSU-level Mammen multiplier bootstrap via generate_survey_multiplier_weights_batch. Each replicate draws (B, n_psu) Mammen multipliers, broadcast to per-obs perturbation eta_obs[g] = eta_psu[psu(g)], weighted OLS refit, weighted CvM via new _cvm_statistic_weighted helper. Joint Stute SHARES the multiplier matrix across horizons within each replicate, preserving both vector-valued empirical-process unit-level dependence (Delgado 1993; Escanciano 2006) AND PSU clustering (Krieger-Pfeffermann 1997). NOT Rao-Wu rescaling -- multiplier bootstrap is a different mechanism. Yatchew: closed-form weighted OLS + pweight-sandwich variance components (no bootstrap): sigma2_lin = sum(w * eps^2) / sum(w) sigma2_diff = sum(w_avg * diff^2) / (2 * sum(w)) [Reviewer CRITICAL #2] sigma4_W = sum(w_avg * eps_g^2 * eps_{g-1}^2) / sum(w_avg) T_hr = sqrt(sum(w)) * (sigma2_lin - sigma2_diff) / sigma2_W where w_avg_g = (w_g + w_{g-1}) / 2 (Krieger-Pfeffermann 1997 Section 3). All three components reduce bit-exactly to existing unweighted formulas at w=ones(G); locked at atol=1e-14 by direct helper test. Workflow under survey/weights: skips the QUG step with UserWarning (per C0 deferral), sets qug=None on the report, dispatches the linearity family with survey-aware mechanism, appends "linearity-conditional verdict; QUG-under-survey deferred per Phase 4.5 C0" suffix to the verdict. all_pass drops the QUG-conclusiveness gate (one less precondition). HADPretestReport.qug retyped from QUGTestResults to Optional[QUGTestResults]; summary/to_dict/to_dataframe updated to None-tolerant rendering. Pweight shortcut routing: weights= passes through a synthetic trivial ResolvedSurveyDesign (new survey._make_trivial_resolved helper) so the same kernel handles both entry paths -- mirrors PR #363's R7 fix pattern on HAD sup-t. Replicate-weight survey designs (BRR/Fay/JK1/JKn/SDR) raise NotImplementedError at every entry point (defense in depth, reciprocal- guard discipline). The per-replicate weight-ratio rescaling for the OLS-on-residuals refit step is not covered by the multiplier-bootstrap composition; deferred to a parallel follow-up. Per-row weights= / survey=col aggregated to per-unit via existing HAD helpers (_aggregate_unit_weights, _aggregate_unit_resolved_survey; constant-within-unit invariant enforced) through new _resolve_pretest_unit_weights helper. Strictly-positive weights required on Yatchew (the adjacent-difference variance is undefined under contiguous-zero blocks). Stability invariants preserved: - Unweighted code paths bit-exact pre-PR (the new survey/weights branch is a separate if arm; existing 138 pretest tests pass unchanged). - Yatchew weighted variance components reduce to unweighted at w=1 at atol=1e-14 (locked by TestYatchewHRTestSurvey). - HADPretestReport schema bit-exact on the unweighted path; qug=None triggers the new None-tolerant rendering only on the survey path. 20 new tests across TestHADPretestWorkflowSurveyGuards (revised from C0 rejection-only to C functional + 2 mutex/replicate-weight retained), TestStuteTestSurvey (7), TestYatchewHRTestSurvey (7), TestJointStuteSurvey (5). Full pretest suite: 158 tests pass. Patch-level addition (additive on stable surfaces). See docs/methodology/REGISTRY.md "QUG Null Test" -- Note (Phase 4.5 C) for the full methodology. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R1 P0 — Stute survey path silently accepted zero-weight units, which leak into the dose-variation check + CvM cusum + bootstrap refit while contributing zero population mass. Extreme case: only zero-weight units carry dose variation -> spurious finite test statistic with no warning. Fix: strictly-positive guards on every survey-aware Stute / Yatchew / workflow entry point (the weights= shortcut already had this; survey= branch was the gap). R1 P1 #1 — aweight/fweight survey designs slipped through pweight-only formulas silently (the variance components are derived assuming pweight sandwich semantics). Fix: weight_type='pweight' guards added in _resolve_pretest_unit_weights and on every direct-helper survey= branch (stute_test, yatchew_hr_test, stute_joint_pretest). Mirrors HAD.fit guard at had.py:2976 + survey._resolve_pweight_only at survey.py:914. R1 P1 #2 — workflow's row-level weights= crashed on staggered event- study panels because _validate_multi_period_panel filters to last cohort but the joint wrappers re-aggregate with the original full- panel weights array. Fix: subset joint_weights to data_filtered's rows via data.index.get_indexer(data_filtered.index) BEFORE passing to the wrappers. Mirrors HeterogeneousAdoptionDiD.fit positional- index pattern. Survey= path is unaffected (column references resolve internally on data_filtered). R1 P3 — REGISTRY C0 note still said "the same gate applies to did_had_pretest_workflow" and "Phase 4.5 C uses Rao-Wu rescaling"; both are stale post-C. Updated to clarify (a) workflow gate was temporary and is now closed by C, (b) qug_test direct-helper gate remains permanent, (c) C uses PSU-level Mammen multiplier bootstrap (NOT Rao-Wu rescaling). 7 new tests in TestPhase45CR1Regressions covering: zero-weight survey on stute_test / stute_joint_pretest / workflow; aweight rejection on stute_test / workflow; fweight rejection on yatchew_hr_test; staggered event-study workflow with weights= (catches the length-mismatch crash). 165 pretest tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R2 P1 #1 (Code Quality) -- joint_pretrends_test and joint_homogeneity_test direct calls still crashed on staggered panels because the staggered- weights subset fix from R1 was only applied at the workflow level. The wrappers run their own _validate_had_panel_event_study() and may filter to data_filtered, then passed the original full-panel weights array to _resolve_pretest_unit_weights(data_filtered, ...) which expects the filtered row count. Fix: subset row-level weights to data_filtered.index positions (via data.index.get_indexer) BEFORE _resolve_pretest_unit_weights, mirroring the workflow fix. R2 P1 #2 (Methodology) -- REGISTRY note documented the bootstrap perturbation as `dy_b = fitted + eps * w * eta_obs`, but the code does `dy_b = fitted + eps * eta_obs` (no `* w`). Code is correct: paper Appendix D wild-bootstrap perturbs UNWEIGHTED residuals; weighting flows through the OLS refit and the weighted CvM, not through the perturbation. Adding `* w` would over-weight by w². Fix: update REGISTRY note to remove the spurious `* w` and clarify the canonical form. Add a regression that pins (a) bit-exact cvm_stat reduction at uniform weights, (b) bootstrap p-value distributional agreement within Monte-Carlo noise. R2 P3 -- in-code docstrings still referenced the pre-Phase-4.5-C contract: - qug_test docstring said survey-aware Stute "admits a Rao-Wu rescaled bootstrap" (PSU-level Mammen multiplier bootstrap is what shipped). Updated to reflect the correct mechanism. - HADPretestReport.all_pass docstring described the unweighted contract only; survey/weights path drops the QUG-conclusiveness gate (linearity-conditional admissibility per C0 deferral). Updated. 3 new regression tests in TestPhase45CR1Regressions: - test_joint_pretrends_test_staggered_weights_subset - test_joint_homogeneity_test_staggered_weights_subset - test_stute_survey_perturbation_does_not_double_weight (locks the perturbation form via cvm_stat bit-exact reduction + p-value MC bound) 168 pretest tests pass (was 165 after R1). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R3 P0 (Methodology) -- variance-unidentified survey-design guard. generate_survey_multiplier_weights_batch returns an all-zero (B, n_psu) multiplier matrix when n_psu - n_strata <= 0 (e.g. unstratified single-PSU design where df_survey = n_psu - 1 = 0). Without a guard, the Stute survey path treated zero perturbations as a valid bootstrap law and emitted p_value = 1/(B+1) for any positive observed CvM (spurious rejection on a survey edge case). Fix: detect df_survey <= 0 BEFORE calling the multiplier helper and return a NaN p_value with a UserWarning, matching the broader package convention for variance-unidentified survey designs (mirrors compute_survey_vcov NaN treatment, ResolvedSurveyDesign.df_survey semantics). Applied in stute_test (single-horizon) and stute_joint_pretest (joint variant); the data-in wrappers and the workflow propagate via the same calls. R3 P3 -- in-code docstrings still drifted on three surfaces: - qug_test NotImplementedError text said survey-aware Stute is "planned" and "admits a Rao-Wu rescaled bootstrap"; both are stale post-Phase-4.5-C. Updated to reflect the shipped reality (PSU-level Mammen multiplier bootstrap; QUG step is skipped under survey/weights in the workflow). - joint_pretrends_test, joint_homogeneity_test, stute_joint_pretest Parameters sections did not document the new weights= / survey= kwargs at all. Added with the per-test contract (per-unit aggregation, staggered subsetting, replicate/aweight/fweight rejection, variance-unidentified guard). 3 new regression tests in TestPhase45CR1Regressions: - test_stute_test_single_psu_unstratified_returns_nan - test_stute_joint_pretest_single_psu_unstratified_returns_nan - test_workflow_single_psu_propagates_nan_through_stute (also documents Yatchew's PSU-agnostic-by-design behavior; verdict carries the linearity-conditional suffix) 171 pretest tests pass (was 168 after R2). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R4 P0 (Methodology) -- Yatchew test statistic was not invariant to uniform pweight rescaling. The formula `T_hr = sqrt(sum(w)) * (...)` makes T_hr scale as sqrt(c) under weights -> w * c, so weights=w and weights=100*w produced different p-values for the same design. Worse, SurveyDesign.resolve() normalizes pweights to mean=1 internally, so the survey= entry path and the weights= shortcut disagreed numerically. Fix: normalize per-unit pweights to mean=1 at every helper entry (stute_test, yatchew_hr_test, stute_joint_pretest) and at the workflow resolution helper. Matches SurveyDesign.resolve() convention; makes the Yatchew statistic scale-invariant; ensures weights=w and survey=SurveyDesign(weights="w") produce identical results for the same design. Stute is internally scale-invariant in functional form but normalization is required so the bootstrap helper sees the same weight vector under both entry paths (cross-path numerical agreement). R4 P1 (Code Quality) -- column-vector weights (e.g. `df[["w"]].to_numpy()` producing (G, 1)) silently broadcast through weighted moments / CvM sums instead of raising. Fix: validate via `_validate_1d_numeric` on all `weights=` arrays in stute_test, yatchew_hr_test, stute_joint_pretest; add explicit ndim check in `_resolve_pretest_unit_weights` with a hint about the common df[["w"]].to_numpy() mistake. 6 new regression tests in TestPhase45CR1Regressions: - test_yatchew_weights_scale_invariant (weights=w vs weights=100*w) - test_stute_weights_scale_invariant (mirror for Stute) - test_workflow_weights_eq_survey_at_overall_path (weights= shortcut and survey=SurveyDesign(...) produce identical Yatchew + Stute results, atol=1e-10) - test_stute_test_rejects_2d_weights / test_yatchew_hr_test_rejects_2d_weights / test_workflow_rejects_2d_weights (column-vector rejection at all three direct-helper / workflow entry points) 177 pretest tests pass (was 171 after R3). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R5 P1 (Methodology) -- Stute survey df_survey<=0 guard was too broad. generate_survey_multiplier_weights_batch's pooled-singleton branch for lonely_psu='adjust' produces NONZERO multipliers, so the helper contract is non-degenerate even when df_survey=0 (one PSU per stratum). The previous guard rejected those designs as variance-unidentified when actually the issue is methodological: the analytical variance target requires a pseudo-stratum centering transform that has not been derived for the Stute CvM. Fix: mirror HAD sup-t bootstrap's explicit lonely_psu='adjust' rejection at had.py:2081-2118. Add new helper _has_lonely_psu_adjust_singletons in had_pretests.py. stute_test and stute_joint_pretest now reject lonely_psu='adjust' with singleton strata via NotImplementedError BEFORE the df_survey<=0 guard. The guard remains for genuinely degenerate designs (single-PSU unstratified, or one-PSU-per-stratum under remove/certainty where the helper returns all-zero multipliers). R5 P2 -- new regressions cover lonely_psu='adjust' rejection on both single-horizon and joint-Stute helpers; also a positive control showing lonely_psu='remove' singletons still flow through the existing df_survey-based NaN path. 3 new tests in TestPhase45CR1Regressions. REGISTRY note clarifies the carve-out: 'adjust' with singletons raises NotImplementedError; 'remove' / 'certainty' produce all-zero singleton multipliers and return NaN via the variance-unidentified guard. 180 pretest tests pass (was 177 after R4). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R6 P1 #1 (Code Quality) -- did_had_pretest_workflow eagerly resolved weights/survey on the FULL panel before _validate_multi_period_panel applied the staggered last-cohort filter. Because _resolve_pretest_unit_weights enforces strictly-positive per-unit weights / pweight type / etc. on whatever data it sees, zero or otherwise-invalid weights on the soon-to-be-dropped cohort would abort an otherwise-valid event-study run. Fix: defer resolution to per-aggregate branches. - Top-level: only the survey/weights mutex check + use_survey_path presence detection (no resolution). - Overall path: resolve weights/survey AFTER _validate_had_panel (no cohort filtering on this path; original data IS the panel). - Event-study path: do NOT resolve at the workflow level. The joint wrappers (joint_pretrends_test / joint_homogeneity_test) own resolution and already see data_filtered (post staggered filter). Row-level weights= passed through with the existing positional subsetting (R1 P1 fix preserved). R6 P1 #2 (Documentation/Tests) -- positive PSU/strata survey coverage gap. Existing tests covered overall-workflow + trivial/no-PSU smokes; the PSU-aware multiplier-bootstrap path (the core new methodology) was unpinned for joint_homogeneity_test and the event-study workflow. 3 new regression tests in TestPhase45CR1Regressions: - test_joint_homogeneity_test_psu_strata_survey_smoke (non-trivial SurveyDesign(weights=, strata=, psu=) on the linearity wrapper). - test_workflow_event_study_psu_strata_survey_smoke (full event-study dispatch under PSU/strata clustering: validate_multi_period_panel + resolve on data_filtered + pretrends_joint + homogeneity_joint). - test_workflow_event_study_zero_weights_on_dropped_cohort (R6 P1 #1 fix regression: panel where the dropped early cohort has zero weights succeeds on the surviving last cohort; pre-fix this crashed with "weights must be strictly positive"). 183 pretest tests pass (was 180 after R5). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R7 P0 (Methodology) -- weighted CvM was missing the outer measure
factor. Survey-weighted plug-in of Stute's CvM functional integrates
the squared cusum process against the survey-weighted EDF
F_hat_w = (1/W) sum_i w_i delta_{D_i}, which weights BOTH the inner
cusum AND the outer integration measure:
C_g = sum_{h <= g} w_h * eps_h
S_w = (1/W^2) * sum_g w_g * (C_g)^2
Earlier revisions used (1/W^2) * sum_g C_g^2 (no outer w_g) which is
a count-weighted-cusum / uniform-outer-measure functional and silently
misreports survey-weighted Stute statistics for non-uniform weights.
At w=ones(G) both forms reduce to (1/G^2) sum_g C_g^2; only non-uniform
weights distinguish them, which is why the prior reduction tests
didn't catch this.
Fix: add the outer w_sorted factor to _cvm_statistic_weighted. New
oracle test pins the formula on a hand-computed non-uniform-weight
example (w=[1,2,3], eps=[1,-2,3] -> 127/36 outer-weighted, NOT 46/36
count-weighted-cusum). Reduction at w=1 still bit-exact.
R7 P1 (Code Quality) -- survey verdict could leave pass cases starting
with "inconclusive". The previous approach composed the unweighted
verdict against a synthetic NaN QUG and string-replaced "QUG NaN" out;
when no rejections fired and linearity was conclusive, the underlying
"inconclusive - QUG NaN" template would collapse to "inconclusive"
even for all_pass=True paths.
Fix: explicit survey-aware verdict composers
_compose_verdict_overall_survey and _compose_verdict_event_study_survey
that drop QUG from consideration entirely and emit linearity-only
priority text. Both append the
"(linearity-conditional verdict; QUG-under-survey deferred per Phase 4.5 C0)"
suffix in every branch (rejections, inconclusive, fail-to-reject).
4 new regression tests:
- test_cvm_statistic_weighted_outer_measure_oracle (R7 P0 hand-computed)
- test_cvm_statistic_weighted_reduces_at_uniform_weights (R7 P0 reduction)
- test_workflow_overall_survey_pass_does_not_say_inconclusive (R7 P1)
- test_workflow_event_study_survey_pass_does_not_say_inconclusive (R7 P1)
187 pretest tests pass (was 183 after R6).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R8 P3 -- survey all_pass docstring/REGISTRY described the contract too
generically ("at least one linearity test conclusive"), which matches
the overall path but is looser than the implemented event-study path
(which requires BOTH pretrends_joint AND homogeneity_joint conclusive
+ non-rejecting).
Fix: split the survey all_pass description by aggregate in both
HADPretestReport docstring and REGISTRY note:
- overall: "at least one of Stute/Yatchew conclusive + no rejection"
(mirrors paper Section 4 step-3 'Stute OR Yatchew' wording).
- event_study: "both joint variants conclusive + neither rejects"
(same step-2 + step-3 closure as the unweighted aggregate, minus the
QUG step).
Code unchanged; only documentation. 187 pretest tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R9 P1 (Code Quality) -- staggered weights= subsetting bypassed the
standard length check. The R6 P1 fix added pos_idx-based subsetting
to align row-level weights with data_filtered, but the subsetting
ran without first validating that weights had length matching the
ORIGINAL data:
- Oversized arrays were silently truncated (pos_idx slice keeps only
matched rows; tail of weights was discarded with no warning).
- Undersized arrays surfaced raw NumPy IndexError instead of the
package's standard front-door ValueError.
Fix: validate weights as 1D AND len(weights) == len(data) BEFORE the
pos_idx slice, on all three staggered-weights entry points
(did_had_pretest_workflow, joint_pretrends_test,
joint_homogeneity_test). Subset only after validation passes.
R9 P3 -- did_had_pretest_workflow's Notes section still had the
un-split survey all_pass description ("at least one linearity test
conclusive"). The HADPretestReport docstring and REGISTRY were
updated in R8; this third location was missed. Now matches the
event-study contract that requires BOTH joint variants conclusive +
non-rejecting.
4 new regression tests in TestPhase45CR1Regressions:
- test_workflow_event_study_oversized_weights_raises
- test_workflow_event_study_undersized_weights_raises
- test_joint_pretrends_test_oversized_weights_raises
- test_joint_homogeneity_test_undersized_weights_raises
All assert the front-door ValueError contract instead of silent
truncation / raw IndexError. 191 pretest tests pass (was 187 after R8).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R10 P1 (Methodology) -- Stute survey bootstrap was silently miscalibrated on stratified designs. The HAD sup-t bootstrap (had.py:2120+) applies a within-stratum demean + sqrt(n_h/(n_h-1)) small-sample correction AFTER generate_survey_multiplier_weights_batch returns, to make the bootstrap variance match the Binder-TSL stratified target. The same correction has NOT been derived for the Stute CvM functional, so applying the helper's raw multipliers directly to residual perturbations on stratified designs left the bootstrap p-value silently miscalibrated. Per the reviewer's offered "narrow support" path: Phase 4.5 C now explicitly rejects stratified designs on the Stute family with NotImplementedError. Pweight-only, PSU-only, and FPC-only designs remain supported (the helper's output is appropriately scaled for those without further correction). Stratified is a follow-up after the matching Stute-CvM stratified-correction derivation lands. Mirrors the lonely_psu='adjust' rejection pattern (R5 P1) — both are methodology-gap-driven explicit NotImplementedErrors with documented follow-up. The strata guard supersedes the lonely_psu='adjust' singleton-strata guard for any stratified design (the latter is now defense-in-depth for the unstratified residual case). R10 P3 -- Added "stratified-design rejection" entry to REGISTRY's Note (Phase 4.5 C). Also updated CHANGELOG to narrow the documented survey contract. Tests updated: - test_stute_test_lonely_psu_adjust_singletons_raises -> test_stute_test_stratified_design_raises (the strata guard fires first; the test is still meaningful but on a strata key match). - Same renaming for stute_joint_pretest variant. - test_stute_test_lonely_psu_remove_singletons_returns_nan REMOVED (singleton strata under lonely_psu='remove' now hits the strata guard instead of the df_survey<=0 guard). - test_joint_homogeneity_test_psu_strata_survey_smoke -> test_joint_homogeneity_test_psu_only_survey_smoke (positive coverage on PSU-only design) + new test_joint_homogeneity_test_stratified_raises. - test_workflow_event_study_psu_strata_survey_smoke -> test_workflow_event_study_psu_only_survey_smoke. - test_workflow_event_study_survey_pass_does_not_say_inconclusive switched from strata to PSU-only. 191 pretest tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
0aa1a40 to
d8a7353
Compare
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment✅ Looks good — the prior P1s are addressed, and I did not find any new unmitigated P0/P1 issues in the changed methodology or inference code. Executive Summary
MethodologyNo unmitigated findings. The main prior methodology blocker is closed: the PR now rejects stratified Stute paths up front instead of applying raw survey multipliers on a target that the codebase itself says still lacks the matching Stute-specific correction ( Code QualityNo unmitigated findings. The previously flagged shape/length bug on staggered PerformanceNo findings. MaintainabilityNo unmitigated findings. Tech Debt
SecurityNo findings. Documentation/Tests
Runtime verification unavailable here because the environment does not have |
R12 P3 #1 -- TODO row 98 said Phase 4.5 C ships "PSU/strata/FPC" but R10 narrowed Stute-family support to pweight + PSU + FPC only (stratified rejected with NotImplementedError pending derivation). Updated to reflect the actual support surface and consolidated the stratified-Stute follow-up alongside replicate-weight pretests as the two known Phase 4.5 C follow-ups. R12 P3 #2 -- the new survey test matrix covered pweight-only and PSU-only smokes but no FPC-only case. The bootstrap helper applies sqrt(1 - f) FPC scaling to multipliers under FPC, which was unpinned by direct regression. 2 new positive smokes: - test_stute_test_fpc_only_survey_smoke: direct helper with ResolvedSurveyDesign(fpc=...) populated. - test_workflow_overall_fpc_only_survey_smoke: workflow path with SurveyDesign(weights=, fpc=) column reference. 193 pretest tests pass (was 191). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment✅ Looks good — the prior P1s appear addressed, and I did not find any new unmitigated P0/P1 issues in the changed survey-aware HAD pretest paths. Executive Summary
Methodology
Code QualityNo findings. The changed code also avoids the specific anti-patterns called out in the review rubric: I did not find new inline generic SE/t-stat inference code or partial NaN-guard handling in the modified paths, and the new front-door weight validation/subsetting is consistent across the affected wrappers ( PerformanceNo findings. The PR keeps the existing literal refit/bootstrap structure; I did not see a new asymptotic regression beyond the already-tracked Stute refit debt in MaintainabilityNo findings. The support surface is materially clearer now because unsupported designs fail at the entry points with method-specific messages, and the survey-only verdict composition was split into dedicated helpers instead of string post-processing ( Tech Debt
SecurityNo findings. Documentation/TestsNo findings. The registry, changelog, and TODO now agree on the shipped support surface, and the earlier missing FPC-only coverage is present ( |
Summary
29f8b12). Linearity-family pretests now acceptweights=/survey=kwargs.bootstrap_utils.generate_survey_multiplier_weights_batch(same kernel as PR HAD Phase 4.5 B: weighted mass-point 2SLS + event-study survey composition + sup-t bootstrap #363 sup-t). Joint Stute SHARES the multiplier matrix across horizons, preserving vector-valued empirical-process unit-level dependence + PSU clustering.w=ones(G)(locked atatol=1e-14).UserWarningper C0 deferral, setsqug=Noneon report, dispatches survey-aware sub-tests, appends "linearity-conditional verdict; QUG-under-survey deferred per Phase 4.5 C0" suffix.QUGTestResultstoOptional[QUGTestResults];summary/to_dict/to_dataframeupdated to None-tolerant rendering.NotImplementedErrorat every entry point (defense in depth) — parallel follow-up.Methodology
Stute calibration (locked decision via plan-mode AskUserQuestion): PSU-level Mammen multipliers, NOT Rao-Wu rescaling — different mechanism. Reuses the kernel from PR #363's HAD event-study sup-t bootstrap. Per-obs perturbation
eta_obs[g] = eta_psu[psu(g)], weighted OLS refit, weighted CvM via new_cvm_statistic_weighted. Joint Stute shares the(B, n_psu)matrix across horizons within each replicate.Yatchew weighted variance components (Krieger-Pfeffermann 1997 §3 pair-weight convention):
sigma2_lin = sum(w·eps²) / sum(w)sigma2_diff = sum(w_avg·diff²) / (2·sum(w))withw_avg_g = (w_g+w_{g-1})/2. Divisor usessum(w)(=G at w=1), NOTsum(w_avg), to match the existing(1/(2G))unweighted formula athad_pretests.py:1635(Reviewer CRITICAL Add fixed effects and absorb parameters to DifferenceInDifferences #2).sigma4_W = sum(w_avg·prod) / sum(w_avg)reduces to(1/(G-1))·sum(prod)at w=1.T_hr = sqrt(sum(w))·(sigma2_lin-sigma2_diff)/sigma2_W(effective-sample-size convention).Trivial-survey-equivalence is DISTRIBUTIONAL (not bit-exact at
atol=1e-10) — the survey path uses batchedgenerate_survey_multiplier_weights_batchwhile unweighted uses per-iteration_generate_mammen_weights. Different RNG consumption ordering means same-seed produces different multiplier matrices but the bootstrap p-value distributions agree at large B (Reviewer CRITICAL #3 reframe).Stability invariants preserved
ifbranch for survey/weights). All 138 existing pretest tests pass unchanged.w=1atatol=1e-14(locked byTestYatchewHRTestSurvey::test_weighted_reduces_to_unweighted_at_uniform_weights).HADPretestReportschema bit-exact on the unweighted path;qug=Nonetriggers None-tolerant rendering only on the survey path.Files
diff_diff/had_pretests.py_fit_weighted_ols_intercept_slope/_cvm_statistic_weighted/_resolve_pretest_unit_weightshelpers +HADPretestReport.qugretyped to Optionaldiff_diff/survey.py_make_trivial_resolvedhelper for pweight-shortcut routingdocs/methodology/REGISTRY.mdtests/test_had_pretests.pyCHANGELOG.mdTODO.mdTest plan
pytest tests/test_had_pretests.py -v— 158/158 green (138 pre-PR + 20 new)black diff_diff tests— cleanruff check diff_diff tests— clean (auto-fixed import ordering)stute_test(weights=),yatchew_hr_test(weights=),did_had_pretest_workflow(weights=)all functional with QUG-skip warning🤖 Generated with Claude Code